Fix some threading issues (some free-threading specific) by seberg · Pull Request #2162 · NVIDIA/cuda-python

seberg · 2026-06-02T16:33:02Z

This fixes a few threading issues, but we may want to discuss some details still.

The GraphNode cleanup order is an important fix: Another thread may end up with the same pointer (but new object) as soon as we clean it up. So we have to remove it from the cache before cleaning it up.
Use of atomics: I think this is needed, but for this one place an atomic seemed more reasonable. (However, hard to test and if it can fail IIUC only on ARM.)
(A critical section would work as well, since it is a lock which enforces order.)
The critical sections should be pretty safe. I am not sure they will all ensure that the object is always the identity but I am pretty sure it protects from worse races.
pytest-run-parallel did find this for MemPool.attributes and I could not reproduce after adding it. The other critical sections are added with a code-search, though.
The split mutex: This is thread-unsafe and failed threaded testing. But I am honestly not sure if that isn't just expected, or whether the mutex is good but it should also be safe from within CUDA?
Use of setdefault cached pattern is largely just normalizing. Without the return dict.setdefault a different instance may be returned on different threads (or a cache entry replaced). For the cyGraphMemoryResource that triggered a test with pytest-run-parallel although that doesn't mean it is problematic as such. cuda-pathfinder uses functools.cache, but usually for strings; the one we may want to look at is load_nvidia_dynamic_lib.

Checklist

Implementing gh-2114 would add implicit coverage for most but the atomics and some of the critical sections (if it adds it might be so rare to effectively never trigger). So this is split out.
(If preferred I can add some explicitly threaded tests.)

I have no full runs with pytest-run-parallel, but I want to clean up the test harness a bit more. I also still should test without free-threading (useful to flush out forgotten with nogil: blocks that can cause deadlocks).

copy-pr-bot · 2026-06-02T16:33:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

seberg · 2026-06-02T16:34:17Z

+            if _can_use_structured_sm_split():
+                return _split_with_general_api(self, opts, dry_run)
+            # SplitByCount requires the same 12.4+ as green ctx support (already checked above)
+            return _split_with_count_api(self, opts, dry_run)


Pointing this out inline, because this is probably the thing I am least sure about. Should SMResource.split() be thread-safe if called on the same resource from multiple threads?

mdboom · 2026-06-10T14:11:28Z

This should probably be a separate PR, but are there best practices for CI to help ensure these sorts of issues don't come back? Should we be running pytest-run-parallel on our 3.14t configurations, for example?

seberg · 2026-06-10T14:19:31Z

Should we be running pytest-run-parallel on our 3.14t configurations, for example?

Yeah, I just thought I'd split out this PR first and get it in and then open a PR for that. Running pytest-run-parallel is +maybe not a huge PR but also a non-trivial PR because we probably want a "mini pytest plugin" for init_cuda style logic.

(ideally we use pytest-run-parallel together with a TSAN environment, I haven't actually done TSAN yet locally. Since we have ASAN jobs, I think that may well be very doable.)

This fixes a few threading issues, but we may want to discuss some details still. * The GraphNode cleanup order is an important fix. Another thread may end up with the same pointer (but new object) as soon as we clean it up. So we have to remove it from the cache before cleaning it up. * Use of atomics: I think this is needed, but for this one place an atomic seemed more reasonable. (However, hard to test and if it can fail IIUC only on ARM.) * The critical sections should be pretty safe. I am not sure they will all ensure that the object is always the _identity_ but I am pretty sure it protects from worse races. (Testing did find this for MemPool.attributes, not others yet. Testing with thread-sanitizer might flush out some...) * The split mutex: This is thread-unsafe. But I am honestly not sure if that isn't just expected, or whether the mutex is good but it should also be safe from within CUDA. * Use of `setdefault` cached pattern is largely just normalizing. Without the `return dict.setdefault` a different instance may be returned on different threads (or a cache entry replaced). For the `cyGraphMemoryResource` that triggered a test with pytest-run-parallel although that doesn't mean it is problematic as such. `cuda-pathfinder` uses functools.cache, but usually for strings; the one we may want to look at is `load_nvidia_dynamic_lib`. Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>

…s good... Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>

rparolin · 2026-06-12T23:56:09Z

It would be great if at least the critical section and atomics changes were broken out into a separate PR.

seberg · 2026-06-15T10:19:08Z

OK, I split out most things and will update after that (or split out the final chunks), until then marking as draft.

What is remaining is the GraphNode cleanup order and the mutex for SMResource.split().

Andy-Jost

Just commenting (not approving) because this PR is marked draft (I think it will be split).

I'm fine with these fixes as they are, though I'd also like to explore more structural fixes to decrease the likelihood problems are reintroduced. None of that blocks this change for me. Thematically, I think it amounts to moving more things into the C++ layer. Some ideas below from my bot (where Level 1 and Level 2 refer to the registry levels documented in REGISTRY_DESIGN.md):

There are more structural options, and they're largely aligned with where the handle layer was already heading. This PR is tactical patching; the strategic fix is centralize lifetime + identity + lazy state in C++ boxes/factories, and make Cython thin.

Push lazy-init into Boxes (replaces atomics/critical sections for CUDA metadata)
A lot of what got @cython.critical_section / atomics is really "query driver once per handle": Buffer mem attrs, pool attrs, kernel attrs. That belongs in the box next to the resource, with std::once_flag or a box mutex:

  struct DevicePtrBox {
      CUdeviceptr resource;
      mutable MemAttrs attrs;
      mutable std::once_flag attrs_init;
  };

Cython properties become return box->get_attrs() — no Python-side check-then-act. This matches the existing handle design: state lives with ownership, not with the Python wrapper.

Unified C++ factory + destroy protocol (fixes L1 ordering in one place) create_graph_node_handle already owns Level 1. Extend it to own the full lifecycle:

  create_graph_node(CUgraphNode, GraphHandle) → GraphNodeHandle + Python wrapper
  destroy_graph_node(GraphNodeHandle) → { pop L2, unregister L1, null resource, cuGraphDestroyNode }

All under one C++ mutex. Then Cython GraphNode.destroy() is a one-liner. The invalidate-then-destroy rule becomes impossible to get wrong at call sites.

Templatized Level-2 / as_py identity cache in C++
REGISTRY_DESIGN.md already cites pybind11's registered_instances. A generic:

  template<typename Handle>
  class PyIdentityRegistry {
      // key: owner identity (owner_before / box pointer)
      // value: PyObject* via weakref
  };

could serve GraphNode, Kernel, Event, etc. Key on handle ownership, not raw CU*. Registration happens in the factory; unregistration in the destroy protocol. Cython stops maintaining _node_registry by hand.

as_py(h) in DESIGN.md was meant to return Python wrappers — today GraphNode's as_py still goes to cuda.bindings. Evolving it to return the cached cuda.core object would collapse L2 into the handle layer.

What you still can't fully move out of Cython
• Polymorphic construction: GN_create_impl dispatches to EmptyNode, KernelNode, … C++ either needs a typed factory table or a Cython callback registered at init — still one entry point, but not zero Cython.
• Pure-Python lazy caches: IPCDataForBuffer, DLPack/StridedMemoryView inference, ComputeCapability tuples — these aren't handle metadata; they'll need Python locks or redesign.
• APIs that aren't handle-backed: SMResource.split() — mutex or a documented "caller serializes" contract unless the driver guarantees more.

seberg · 2026-06-17T08:24:22Z

Not sure I am following all of it, but C++ threading does seem much more mature than Python/Cython so pushing more to C++ is probably a good idea.
Some notes/thoughts:

In most cases here the critical_section is on functions that return Python objects (one where it wasn't uses the atomic, I forgot to delete it), so there are two considertions:
1. Creating the object is quick but not free performance wise.
2. It may be nice if things are idempotent which a C++ side re-wrapping can't solve?
  (But: I don't think the @critical_section doesn't guarantee idempotence fully.)
  I am slightly wondering if we couldn't create a tiny pattern ourselves for this but not sure how nice it'll be my best brainstorming is to create a C++ helper (mirroring dicts):
```
cdef atomic_object:
    cdef get(..., order=something):
    cdef set(..., order=something):
    cdef object setdefault(..., order=something):
```
  but if we go down that route it may make more sense to talk to Cython folks. I think Cython support will improve.
Moving the as_py adds the need to deal with cleaning up weakrefs, but I suspect it is a good idea even just to not split the logic.

github-actions Bot added the cuda.core Everything related to the cuda.core module label Jun 2, 2026

seberg commented Jun 2, 2026

View reviewed changes

seberg force-pushed the threading-fixes branch from 304e64e to 3a7b314 Compare June 2, 2026 16:34

seberg self-assigned this Jun 2, 2026

seberg added this to the cuda.core next milestone Jun 2, 2026

seberg added the bug Something isn't working label Jun 2, 2026

leofang modified the milestones: cuda.core next, cuda.core v1.1.0 Jun 5, 2026

leofang added the P0 High priority - Must do! label Jun 5, 2026

leofang requested review from Andy-Jost and leofang June 8, 2026 01:36

seberg mentioned this pull request Jun 10, 2026

Test using pytest-run-parallel and related fixups in the tests #2194

Draft

2 tasks

seberg force-pushed the threading-fixes branch from fe39d2c to 04c3862 Compare June 10, 2026 20:21

seberg added 4 commits June 11, 2026 15:09

Use C++ bool for atomic

f4cb18d

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>

Forgot to commit a critical section on this branch

91cf6d0

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>

Regenerate .pyi files, but I am a bit unsure the critical_section i…

f5b239c

…s good... Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>

seberg force-pushed the threading-fixes branch from 04c3862 to f5b239c Compare June 11, 2026 13:10

seberg marked this pull request as draft June 15, 2026 10:19

Andy-Jost reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some threading issues (some free-threading specific)#2162

Fix some threading issues (some free-threading specific)#2162
seberg wants to merge 4 commits into
NVIDIA:mainfrom
seberg:threading-fixes

seberg commented Jun 2, 2026

Uh oh!

copy-pr-bot Bot commented Jun 2, 2026

Uh oh!

seberg Jun 2, 2026

Uh oh!

mdboom commented Jun 10, 2026

Uh oh!

seberg commented Jun 10, 2026 •

edited

Loading

Uh oh!

rparolin commented Jun 12, 2026

Uh oh!

seberg commented Jun 15, 2026

Uh oh!

Andy-Jost left a comment

Uh oh!

seberg commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

seberg commented Jun 2, 2026

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 2, 2026

Uh oh!

seberg Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

mdboom commented Jun 10, 2026

Uh oh!

seberg commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rparolin commented Jun 12, 2026

Uh oh!

seberg commented Jun 15, 2026

Uh oh!

Andy-Jost left a comment

Choose a reason for hiding this comment

Uh oh!

seberg commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

seberg commented Jun 10, 2026 •

edited

Loading